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Abstract 

In this paper, we consider delay minimization for interference networks with renewable energy 
source, where the transmission power of a node comes from both the conventional utiUty power (AC 
power) and the renewable energy source. We assume the transmission power of each node is a function 
of the local channel state, local data queue state and local energy queue state only. In turn, we 
consider two delay optimization formulations, namely the decentralized partially observable Markov 
decision process (DEC-POMDP) and Non-cooperative partially observable stochastic game (POSG). 
In DEC-POMDP formulation, we derive a decentraUzed online learning algorithm to determine 
the control actions and Lagrangian multipliers (LMs) simultaneously, based on the policy gradient 
approach. Under some mild technical conditions, the proposed decentralized policy gradient algorithm 
converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG 
formulation, the transmitter nodes are non-cooperative. We extend the decentralized policy gradient 
solution and establish the technical proof for almost-sure convergence of the learning algorithms. In 
both cases, the solutions are very robust to model variations. Finally, the delay performance of the 
proposed solutions are compared with conventional baseline schemes for interference networks and 
it is illustrated that substantial delay performance gain and energy savings can be achieved. 
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I. Introduction 

Recently, there have been intense research interests to study the interference channels. In IH, IJl, 
the authors show that interference alignment (using infinite dimension symbol extension in time or 
frequency selective fading channels) can achieve optimal Degrees-of-freedom (DoF) and the total 
capacity of the K-user interference channels is given by f log(SNR) + o(log(SNR)). In El, iffl. 
the authors consider joint beamforming to minimize the weighted sum MMSE or maximize the 
SINR of K-pairs MIMO interference channels using optimization approaches. In lH, IS, the authors 
considered decentralized beamforming design for MIMO interference networks using non-cooperative 
games and studied the sufficient conditions for the existence and convergence of the Nash Equilibrium 
(NE). However, all of these works have assumed that there are infinite backlogs at the transmitters, 
and focused on the maximization of physical layer throughput. In practice, applications are delay 
sensitive, and it is critical to optimize the delay performance in the interference network. 

The design framework taking into consideration of queueing delay and physical layer performance 
is not trivial as it involves both queuing theory (to model the queuing dynamics) and information 
theory (to model the physical layer dynamics) Q. The simplest approach is to convert the delay 
constraints into an equivalent average rate constraint using tail probability (large derivation theory), 
and solve the optimization problem using a purely information theoretical formulation based on the 
equivalent rate constraint [Si. However, the control policy thus derived is a function of the channel state 
information (CSI) only, and it fails to exploit data queue state information (DQSI) in the adaptation 
process. The Lyapunov drift approach is also widely used in the literature ^ to study the queue 
stability region of different wireless systems and to establish the throughput optimal control policy 
(in stability sense). A systematic approach in dealing with delay-optimal resource control in general 
delay regime is based on the Markov decision process (MDP) technique 13, ifTOll . ifTTI . However, 
brute-force solution of MDP is usually very complex (owing to the curse of dimensionality) and 
extension to multi-flow problems in interference networks is highly non-trivial. 

Another interesting dimension that has been ignored by most of the above works is the inclusion 
of renewable energy source on the transmit nodes. For instance, there are intense research interests in 
exploiting renewable energy in communication network designs |[T2l - |[T5l . In |[T2ll . |[T3l . the authors 
presented an optimal energy management policy for a solar-powered device that uses a sleep and 
wake up strategy for energy conservation in wireless sensor networks. In lfT4ll . the authors developed 
a solar energy prediction algorithm to estimate the amount of energy harvested by solar panels to 
deploy power-efficient task management methods on solar energy-harvested wireless sensor nodes. 
In |[T5l . the author proposed a power management scheme under the assumption that the harvested 
energy satisfies performance constraints at the application layer. However, in all these works, the 
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delay requirement of applications have been completely ignored. Furthermore, the renewable energy 
source can act as low cost supplement to the conventional utility power source in communication 
networks. Yet, there are various technical challenges regarding delay optimal design for interference 
networks with renewable energy source. 

• Randomness of Renewable Energy Source: Recent developments in hardware design have 
made energy harvesting possible in wireless communication networks |[T6l . lITTl . For example, 
we have solar-powered base stations available from various telecommunication vendors ['17]. 
While the renewable energy source may appear to be completely free, there are various challenges 
involved to fully capture its advantage. For instance, the renewable energy sources are random 
in nature and energy storage is needed to buffer the unstable supply of renewable energy. Yet, 
the cost of energy storage depends heavily on the associated capacity ifTSl . For limited capacity 
energy storage, the transmission power allocation should be adaptive to the CSI, the DQSI as well 
as the energy queue state information (EQSI). The CSI, DQSI and EQSI provide information 
regarding the transmission opportunity, the urgency of the dataflows, and the available renewable 
energy, respectively. It is highly non-trivial to strike a balance among these factors in the 
optimization. 

• Decentralized Delay Minimization: The existing works for the throughput or DoF optimization 
in the interference network [1-6] requires global knowledge of CSI, which leads to heavy 
backhaul signaling overhead and high computational complexity for the central controller. For 
delay minimization with renewable energy source, the entire system state is characterized by 
the global CSI (CSI from any transmitter to any receiver), the global QSI (data queue length 
of all users), and the global EQSI (energy queue length of all users). Therefore, the centralized 
solution (which requires global CSI, DQSI and EQSI) will also induce substantial signaling 
overhead, which is not practical. It is desirable to have decentralized control based on local 
observations only. However, due to the partial observation of the system state in decentralized 
designs, existing solutions of the MDP approach cannot be applied to our problem. 

• Algorithm Convergence Issue: In conventional iterative solutions for deterministic network 
utility maximization (NUM) problems, the updates in the iterative algorithms (such as subgradient 
search) are performed within the coherence time of the CSI (i.e., the CSI remains quasi-static 
during the iteration updates) [IH, [|6l. When we consider delay minimization, the problem is 
stochastic and the control actions are defined over ergodic realizations of the system states 
(CSI, DQSI and EQSI). Furthermore, the restriction of partial observation of system states in 
decentralized control further complicates the problem. As a result, the convergence proof of the 
decentralized stochastic algorithm is highly non-trivial. 
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In this paper, we consider delay minimization for interference networks with renewable energy 
source. The transmitters are capable of harvesting energy from the environment, and the transmission 
power of a node comes from both the conventional utility power (AC power) and the renewable 
energy source. For decentralized control, we assume the transmission power of each node is adaptive 
to the local system states only, namely the local CSI (LCSI), the local DQSI (LDQSI) and the local 
EQSI (LEQSI). We consider two delay optimization formulations, namely the decentralized partially 
observable MDP (DEC-POMDP), which corresponds to a cooperative stochastic game setup (where 
each user cooperatively share a common system utility), and non-cooperative partially observable 
stochastic game (POSG), which corresponds to a non-cooperative stochastic game setup (where 
each user has a different (and selfish) utility. In DEC-POMDP formulation, the transmitters are fully 
cooperative and we derive a decentralized online learning algorithm to determine the control actions 
and the Lagrangian multipliers (LMs) simultaneously based on the policy gradient approach lITTI . 
|[T9l . Under some mild technical conditions, the proposed decentralized policy gradient algorithm 
converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG 
formulation, the transmitters are non-cooperativqj and we extend the decentralized policy gradient 
algorithm and establish the technical proof for almost-sure convergence of the learning algorithms. In 
both cases, the solutions do not require explicit knowledge of the CSI statistics, random data source 
statistics as well as the renewable energy statistics. Therefore, the solutions are very robust to model 
variations. Finally, the delay performance of the proposed solutions are compared with conventional 
baseline schemes for interference networks and it is illustrated that substantial delay performance 
gain and energy savings can be achieved by incorporating the CSI, DQSI and EQSI in the power 
control design. 

II. System Model 

We consider K-pak interference channels sharing a common spectrum with bandwidth WHz as 
illustrated in Fig. [T] Specifically, each transmitter maintains a data queue for the random traffic flow 
towards the desired receiver in the system. Furthermore, the transmitters are fixed base stations but 
the receiver can be mobile. The time dimension is partitioned into scheduling frames (that lasts for 
r seconds). In the following subsections, we shall elaborate the physical layer model, the random 
data source model as well as the renewable energy source model. 

'Non-cooperative nodes means that each transmitter shall optimize its own utility in a selfish manner. 
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A. Physical Layer Model 



The signal received at the k-th receiver is given by: 



Uk = V PkLkkHkkXk + V PnLknHk 



(1) 




where Lfc„ and Hkn are the long term path loss and the microscopic channel fading gain respectively, 
from the n-th transmitter to the /c-th receiver. is the total transmission power of the k-\h transmitter. 
Xn is the information symbol sent by the ?i-th transmitter, and Zk is the additive white Gaussian 
noise with variance Nq. For notation convenience, we define the global CSI as H = {Hkn-,yk,n]. 
Furthermore, the assumption on channel model is given as follows. 

Assumption 1 (Channel Model): We assume that the global CSI H is quasi-static in each frame. 
Furthermore, Hkn{i) is i.i.d. over the scheduling frame according to a general distribution PT{Hkn} 
with E[fffc„] = 1 and Hkn is independent w.r.t. {n,k}. The path loss Lkn remains constant for the 
duration of the communication session. ■ 

Given transmission powers {P^^}, the transmit data rate is given by: 



where ^ G (0, 1] is a constant. Note that Q can be used to model both uncoded and coded systems 
|[20l . For example, ^ = 0.5 for QAM constellation at BER= 1% and ^ = 1 for capacity achieving 
coding (in which Q corresponds to the instantaneous mutual information). 

B. Random Data Source Model and Data Queue Dynamics 

Let A(t) = {Ai{t), • • • , Axit)} be the random new arrivals (number of bits) at the K transmitters 
at the end of the t-th scheduling frame. 

Assumption 2 ( Random Data Source Model): The arrival process Ak{t) is i.i.d. over the schedul- 
ing frame and is distributed according to a general distribution Pr{^fc} with average arrival rate 
Afc = E[^fc]. Furthermore, the random arrival process {Ak} is independent w.r.t. k. ■ 

Let Q(t) = {Qi{t), ■ ■ ■ ,QK{t)} denote the global DQSI in the system, where Qkii) represents 
the number of bits at the queue of transmitter k at the beginning of frame t. denotes the maximal 
buffer size (number of bits) of user k. When the buffer is full, i.e., Qk = N^, new bit arrivals will 
be dropped. The cardinality of the global QSI is /g = (1 + N^^)^ . Given a new arrival Ak{t) at the 
end of frame t, the queue dynamics of transmitter k is given by: 



where Rk{t) is the achievable data rate for receiver k at frame t given in and [2;]y\^jyQ = 
min(x, N^). 




(2) 



Qk{t + 1)= [Qk{t)-Rk{t)Ty +Ak{t) 



(3) 
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C. Power Consumption Model with Renewable Energy Source 

The transmission power of each node comes from both the AC power source and the renewable 
energy source. Specifically, the transmitter is assumed to be capable of harvesting energy from the 
environment, e.g., using solar panels ifTTl . EH. However, the amount of harvestable energy in a frame 
is random. Let X(t) = {Xi{t), ■ ■ ■ , Xxit)} be the harvestable energy (Joule) by the K transmitters 
during the t-th scheduling frame. Note that the harvestable energy X(t) can be interpreted as the 
energy arrival at the t-th frame. 

Assumption 3 (Random Renewable Energy Model): The random process > is i.i.d. over 

the scheduling frame and is distributed according to a general distribution Pr{Xfc} with mean 
renewable energy Xk = K[Xk]. Furthermore, the random process {Xk} is independent w.r.t. k. 

m 

Let E(i) = {Ei{t), • • • , EK{t)} denote the global EQSI in the system, where Ek{t) represents the 
renewable energy level at the energy storage of the k-th transmitter at the beginning of frame t. Let 
denote the maximum energy queue buffer size (i.e., energy storage capacity in Joule) of user 
k. When the energy buffer is full, i.e., E^ = , additional energy cannot be harvested. Given an 
energy arrival of Xk{t) at the end of frame t, the energy queue dynamics of transmitter k is given 
by: 



Ek{t + l)= [Sfc(t)-P*-(t)r]+ + Xfc(t) 



where Pl^{t) is the renewable power consumption that must satisfy the following energy-availability 
constrain^: 

Pl%t)T<E,{t)yk, (5) 

The power consumption is contributed by not only the transmission power of the power amplifier 
(PA) but also the circuit power of the RF chains (such as the mixers, synthesizers and digital-to 
analog converters). Furthermore, the circuit power Pcct is constant irrespective of the transmission 
data rate. Therefore, the total power consumption of user k at the t-th frame is given by 

Pkit) = Pt{t) + P,,t ■ l{Pt > 0) (6) 

Note that in practice, due to the random nature of the renewable energy and the limited renewable 
energy storage capacity, it can be used only as a supplementary form of power rather than completely 
replacing the AC utility power. To support a total power consumption of Pk(t), we can have power 

~ We consider a discrete time system with fixed time step r. Hence, Ekit) represents the energy level at the renewable 
energy storage of the fc-th transmitter at the beginning of frame t, and Pk^^{t)T is the renewable energy consumption. As a 
result, Pk^e{t)T (energy consumed from the renewable energy storage) cannot be larger than Ek{t) (total energy available 
from the renewable energy storage). 
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circuitry |[T2l . |[T3l to control the contributions from AC utility Pk.ac{t) as well as the renewable 
energy storage Pk,e{t) as illustrated in Fig. [T] This is similar in concept to hybrid cars where the 
power is contributed by both the gas engine and the battery. As a result, the total power consumption 
Pk{t) is given by: Pk{t) = Pk,ac{t) + PkA^)- Given Pk,ac{t) and Pk^t), the transmission power 
Pjl^it) is given by: 

(t) = (PkMt) + PkAt) - Pcct ■ i{Pt > 0))^ (7) 



III. Delay Optimal Power Control 
A. Control Policy and Resource Constraints 

We define x = {H, Q, E} as the global system state, and Xk = {{^^fcn, Vn}, Qk, Ek} as the local 
system state for the A;-th transmit node, where Vn} is the LCSo Qk is the LDQSI and Ek 

is the LEQSI. Based on the local system state Xk, transmitter k determines the power consumption 
Pfc = {Pk,ac G Aac:Pk,e G ^e} using a control policy defined below, where Aac = {aLi ' ' ' 
and Ae = {al, ■ ■ ■ , a^} are the AC power allocation space and the renewable power allocation space 
(both with cardinality N), respectively. 

Definition 1 ( Stationary Randomized Decentralized Power Control Policy ): A stationary random- 
ized power control policy for user k, Qk '■ Xk ^ P{Aac, Ae), is a mapping from the local system 
state Xk to a probability distribution over the power allocation space {Aac,Ae}, i.e., QkiXk) = P = 
• • • ,Pn,n} e V{Aac,Ae), where V{Aac,Ae) = {p : J2i,jPi,j = 1 and pij > 0, Vi, j} is the 
space of joint probability distribution over the power allocations, and pij denotes the probability of 
transmission powers {Pk,ac = a^c^ Pk,e = ai}. ■ 

For simplicity, denote the joint control policy as = {r2fc,Vfc}. Note that the power allocation 
policy Ofc should satisfy the energy-availability constraint given in ([5]l, i.e., given Xk = {Hkk, Qk^ Ek}, 
the probability pij of transmission powers {Pk,ac = ciac^ Pk,e = ai} satisfy 

Pij = 0, if ai > Ek/r. (8) 

Furthermore, Q,k should meet the requirement of circuit power Pcct consumption, i.e., 

Pij = 0, if < aic + 4 < Pcct (9) 

Finally, Qk should also satisfy the per-user average AC power consumption constraint: 

- 1 ^ 

Pk{Q) = lim sup - V E^[Pfc,,e(t)] < P^, (10) 

'We denote the local CSI at the fc-th transmit node as {_fffc„,Vn}. However, in practice, the fc-th transmit node only 
needs to observe Hkk and the total interference '}2nj^k PnLknHkn- 



February 13, 2012 



DRAFT 



7 

where the expectation in (fTOl ) is taken w.r.t. the induced probability measure from the policy 0. 

Remark 1 (Formulation with two optimization variables {Pac,Pe}}' While the "reward" of the sys- 
tem dynamics (the transmission rate in Q) depends on the total transmission power P only, it does 
not mean the problem can be formulated with just one variable (total transmission power). We also 
have to look at the "cost" side. While the total power consumption Ptotai = Pac + Pe, Pac and P^. 
have different cost structure (and different constraints) as in (ITOl ) and (|5]), respectively. Hence, the 
problem with Pac and Pg as variables cannot be transformed or reduced into a problem with Ptotai 
as one variable only (due to the constraints). ■ 



B. Parametrization of Control Policy and Dynamics of System State 

In this paper, we consider the parameterized stationary randomized policy, which is widely used in 
the literature |[T9l . Il22l - ll24l . Specifically, the randomized policy 0^ can be parameterized by G^. For 
example, when a local system state realization Xk is observed, the power consumption of transmit 
node k is Pfc = {Pk,ac,Pk,e} with probability fi^^{@k,'Pk) given by |E3: 



cxp(0,,,pj if Pk,e = Pk,ac = or 



E.,,cxp(0^,, ,Jl{ai<E,/T) 



^x„(.^„.^) — if P,,^,<Ek/T and Pk^, + Pk,ac>Pcct (11) 

otherwise, 
where l(-) is the indicator function, and 0^ = {^^j^^Pj^ € M, Vxfc, Pfc}- As a result, the control policy 
Ofc is now parameterized by Qk and is denoted by f^®*". Another possible parameterization is to use 



neural network |[T9]| . |[22l where the probability is given by 



exp (efci+E°=2^fc'A4xfc,P>c)) if Pk,e = Pk,ac = or 

E..,cxp {e,^+j:T=,e,J,,{x,,(al.,ai)))l{ai<E,/T) if p^^^ < ^^^^ ^j^d p^^^ p^^^^ > p^^ 



cct 

otherwise, 

(12) 

where = {9ki G M, i = 1, • • • , a} is the parameter and fkiiXk^^k) is the prior basis function. 
Note that the dimension of the parameter Gfc is reduced to a in this case. 

For a given stationary parameterized control policy 17® (G = {G/t,V/c}), the induced random 
process is a controlled Markov chain with transition probability 

Mx{t + l)lx(i), ^''} = Pr{^(t + 1)} Pr{Q(t + 1), E(f + l)\x{t), (13) 

where the joint data and energy queue transition probability is given by 

Fv{Q{t + l),E{t + l)\x{t),n^{xm 

n,Pr{A.(t)}Pr{X,.(t)K,(Gfc,P,(t)) if ^^"^^ 
otherwise. 
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where Eu 



[Ek{t)-Pk,e{t)TY+Xk{t)\ Qk = \[Qk{t)-Rkit)r]^ + Akit)] .andiifc(t) 



AN, 



is the achievable data rate of receiver k given in Q under the power allocation P = {Pfc(t),Vfc}. 
Note that it is not sufficient to specify the evolution of the joint process (xi(*)) ■ ■ ■ ,XK{t)) by just 
describing the measure of individual local processes Xk{t)- This is because the individual state process 
Xfc(t) are not independent and there are mutual coupling. 

Given a unichain policy fi®, the induced Markov chain is ergodic and there exists a unique 

steady state distribution vr^, where vr^(x) = limt_j.oo Pi'lxli) = x] lilil- The average delay utility 
of user k, under a unichain policy 17®, is given by: 

Tfc(e) = lim sup ;^ E^"[/(Qfc(t))], (15) 

where f{Qk) is a monotonic increasing utility function of Q^. For example, when f{Qk) = Qk/^k^ 
using Little's Law 02, Tkifl) is the average delaj^ of user k. When f{Qk) = l{Qk > Ql), Tki@) 
is queue outage probability^ Since is a constant, the average delay Tk{Q) is proportional to the 
average queue length. 

C. Problem Formulation 

Note that the stochastic dynamics of the K data queues and energy queues are coupled together 
via the control policy fi®. In this paper, we consider two different decentralized control problems: 

1 ) DEC-POMDP Problem: In this case, all the transmitter nodes are cooperative and we seek to 
find an optimal stationary control policy $7® to minimize a common weighted sum delay utility in 
([T5]) . Since the control policy Q^'' is only a function of the local system state Xfc, the problem is a 
partially observed MDP, which is summarized below: 

Problem 1 (Delay Optimal DEC-POMDP): For some positive constants /3 = {/3/fc,V/c}, find a 
stationary control policy 0® that minimizes: 

mineTl = T.kh^k{Q) = limsupr^^ ^ ELi [^(^(t), J7® 
subject to Pfc(17®) = Pfc(e) < Pfc°,VA; , (16) 

Ek<N^,yk 

where (^(^(i), 17® = J2k f^kfiQk) is the joint per-stage utility. The positive constants (3 
indicate the relative importance of the users, and for the given /3, the solution to (fT6l ) corresponds to 
a Pareto optimal point of the multi-objective optimization problem: mine Tfc(0), V/c. ■ 

"* Since the buffer size is finite, Tfc(fi) is the average delay when f{Qk) = Qfc/(Afc(l — Pioss)), where Pioss is the 
packet drop rate due to buffer overflow. However in practice our target Pioss ^ 1, and hence f{Qk) ~ Qk/{^k) is a 
good approximation for the average delay. Furthermore, this approximation is asymptotically tight as the data buffer size 
increases. In practice, the approximation error will not be significant since the system will have reasonable Pioss (e.g. 
0.1%). 

^The probability that the queue state exceeds a threshold Q1, i.e., Pr{Qk > Qk}- 
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Note that the average AC power constraint is commonly used in a lot of existing studies iH, lITOll 
and is very relevant in practice (because the electric bill is charged by average AC power consumption 
X time of usage). The motivation of Problem [T] is to optimize the delay performance under average 
cost constraint (AC power) by fully utilizing the free renewable energy. Problem [T] is also equivalent 
to minimizing the average AC power consumption subject to average delay constraint because they 
have the same Lagrangian function. 

2) Non-Cooperative POSG Problem: In this case, the K transmitter nodes are non-cooperative 
and we formulate the delay utility minimization problem as a non-cooperative partially observable 
stochastic game (POSG), in which the user k competes against the others by choosing his power 
allocation policy f^®', to maximize his average utility selfishly. Specifically, the non-cooperative 
POSG is formulated as Problem |2] 

Problem 2 (Delay Optimal Non-Cooperative POSG): For transmitter k, we try to find a stationary 
control policy 0,®'' that minimizes: 

mine, Tfc(e,, e_fc) = limsupr^^ ^ ^^=1 Ef^""'^"" [/(QfcW)] 
subject to Pfe(Gfc,G_fc) <P0, ,Vfc (17) 

Ek < N^,yk 

where 0_fe = {Q^^i g^^}, and = users' policies except the 

k-th user. ■ 
The local equilibrium solutions of the non-cooperative POSG (fTTl ) are formally defined as follows. 



Definition 2 (Local Equilibrium of Non-Cooperative POSG): A profile of the power allocation pol- 
icy 0®' = {f^f ^ > ■ ■ ■ ) } is the local equilibrium of the game (fTTl ) if it satisfies the following 
fixed point equations for some 7* = {7^ > 0,\/k}, 

and Pfc(e*,e*_,)-po<o, 7*(p,(e*,e*_,)-po)=o 

where ^^(Gfc, @.k, 7fc) = ^^(9^, @.k) + lk{Pk{Qk. Q~k) - P^)- ■ 
Remark 2 (Interpretation of the Local Equilibrium): ipk{Qk^^-k^lk) can be regarded as the La- 
grange function for user k (given the policies of the other users G_fc) in the non-cooperative POSG 
problem ([TT] ). From the Lagrangian theory 1251 . a local equilibrium Vt^' = {f^f\ • • • ,r2^^ } means 
that given f^^^*", is the local optimal solution for the non-cooperative POSG problem (fTTl ). ■ 

Remark 3 (Comparison between the DEC-POMDP and Non-Cooperative POSG Problems): In Prob- 
lem [T] (DEC-POMDP), the controller is decentralized at the K transmitters and they have access to 
the local system state only. Yet, the K controllers are fully cooperative in the sense that they are 
designed to optimize a common objective function where the per-stage utility is assumed to be known 
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globally through message passing. As a result, they interact in a decentralized cooperative manner. 
On the other hand, in the non-cooperative POSG formulation, the K controllers are non-cooperative 
in the sense that each controller is interested in optimizing its own delay utility function. Hence, they 
interact in a decentralized non-cooperative manner. ■ 
Note that the policies {Ofc,VA;} are reactive or memoryless in that their choice of action is based 
only upon the current local observation. Furthermore, the DEC-POMDP and the non-cooperative 
POSG problem are NP-hard |[26l . Instead of targeting at global optimal solutions, we shall derive 
low complexity iterative algorithms for local optimal solutions in the following sections. 

IV. Decentralized Solution for DEC-POMDP 

In this section, we shall propose a decentralized online policy gradient update algorithm to find 
a local optimal solution for problem (fT6l) . The proposed solution has low complexity and does not 
require expUcit knowledge of the CSI statistics, random data source statistics as well as the renewable 
energy statistics. 

A. Decentralized Stochastic Policy Gradient Update 

We first define the Lagrangian function of problem ([T6l) as 

V'(e,7) = Yl {PkTk{@) + ik(Pk{e) - Pi)) , (18) 

k 

where 7 = {7^ G ]R+,V/c} is the LM vector w.r.t. the average power constraint for all the users. 
The local optimal solution &* for problem ([T6l ) should satisfy the following first-order necessary 
conditions given by |[25]| 

Ve^(e*,7*) = ^^^^ 
7^(Pfc(G*)-pO) = o,yk 

Define a reference stated {Q^,E^} = {{Qi,--- ,Q^},{^f,--- >^i-}} and using perturbation 
analysis ifTTI . |[22l . the gradient] VeV'(0)7) is given in the following lemma. 

Lemma 1 ( Gradient of the Lagrangian Function ): The gradient of the Lagrangian function is given 

by 

Ve.^(e, 7) = Ex Ep Ax; 6)^^(9, p f^l^f^^f^U ix, P; 7, e) (20) 

*For example, we can set {Ql = Nq,eI = NE,Vk} without loss of optimality. 

'Note that a change of O will affect the function ipi^j l) via the probability measure behind the expectation in V'(Qi 7) 
and hence, deriving the gradient is highly non-trivial. 
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where tt{x', ©) is the steady state probabihty of state x under the poUcy ft , fJ^xi^j P) = Ylk ^^Xk (©fc) Pfc) 
is the probabihty that joint action P is taken, and ^^''^^q ^■p'^'"'' — 0' if A*yfr(0fc,Pfc) = 0, 

T'~l 

q{x, P; 7, 0) = E^" [ {g^{x{t),P{t)) - V(e, 7)) \X{0) = X, P(0) = p] , (21) 

t=o 

where g^x, P) = Zk MiQk) + 7kiPk,ac - P^)- = min{t > 0|Q(t) = B{t) = E^} is the 
first future time that the reference state {Q^,E^} is visited. ■ 
Proof: Please refer to Appendix |Al ■ 
Note that the brute force solution of ([T9l ) requires huge complexity and knowledge of the CSI 
statistics, random data source statistics as well as the renewable energy statistics. Based on Lemma 
[H we shall propose a low complexity decentralized online policy gradient update algorithm to obtain 
a solution of (fT9l ). Specifically, the key steps for decentralized online learning is given below. 

• Step 1, Initialization: Each transmitter initiates the local parameter G^. 

• Step 2, Per-user Power Allocation: At the beginning of the t-th frame, each transmitter 
determines the transmission power allocation according to the policy J]®*" based on the local 
system state Xfc> and transmit at the associated achievable data rate given in ©. 

• Step 3, Message Passing among the K Transmitter^ At the end of the t-th frame, each 
transmitter shares the per-user per-stage utility g^^k = l^kf {Qk)+lk{Pk,ac— Pk) ^"^^ the reference 
state indication C^,, where Cfc = 1 if {Qk = Qi^ Pk = ^1}' ^'^'^ Cfc = otherwise. 

• Step 4, Per-user Parameter 0^ Update: Based on the current local observation, each of the 
transmitters updates the local parameter 0^ according to Algorithm [T] 

• Step 5, Per-user LM Update: Based on the current local observation, each of the transmitters 
updates the local LMs {7fc,VA;} according to Algorithm [T] 

Fig. |2] illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters 
and LMs update in Step 4 and Step 5 is given below: 

Algorithm 1 (Online Learning Algorithm for Per-user Parameter and LM): Let Xk = {Hkk, Qk-, Ek} 
be the current local system state, P^ be the current realization of power allocation, gi = J2k9L,k 
be the current realization of the per-stage utility and ( = flk Ck be the current realization of the 
reference state indication. The online learning algorithm at the k-th transmitter is given by 

Qi^' =Qi-ait)[g,-L^)4 ^^^^ 

7^ =bi+m{Pk,ac-p'k)]\ 

^ Note that the per-user per-stage utility includes not only the packet buffer states but also the control action. As a result, 
just broadcasting nodes' buffer states is not enough to replace the per-user per-stage utility. Furthermore, if each user wants 
to have complete state information, they need to share both the buffer states and the CSI states. As a result, it will cause 
much information exchanges compared with the per-user per-stage utility sharing. Table |l] summarizes the communication 
overhead by exchanging the per-stage utility and sharing the buffer states and the CSI states. 
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where = L*+i - a(t) (^ql - L*), and 

Stepsizes {a(t), are non-increasing positive scalars satisfying J2t ^(*) = J2t ^(0 ~ 

< oo, §1 ^ 0. ■ 
Remark 4 (Feature of the Learning Algorithm\l}: The learning algorithm only requires local ob- 
servations only, i.e., local system state {Hkk,Qk, E^} at each transmit node, and limited message 
passing of {CfeiSL.fc}, where the overhead is quite mild ETl . Both the per-user parameter and the 
LMs are updated simultaneously and distributively at each transmitter. Furthermore, the iteration is 
online and proceed in the same timescale as the CSI and QSI variations in the learning algorithm. 
Finally, the solution does not require knowledge of the CSI distribution or statistics of the arrival 
process or renewable energy process, i.e., robust to model variations. ■ 

B. Convergence Analysis 

In this section, we shall establish the convergence proof of the proposed decentralized learning 
algorithm [T] Since we have two different stepsize sequences {a{t)} and {b{t)} with b{t) = o{a{t)), 
e.g., a{t) = jjjj and b{t) = ^. the per-user parameter updates and the LM updates are done 
simultaneously but over two different timescales. During the per-user parameter update (timescale 
I), we have 7^"*"^ — 7^. = 0{b{t)) = o{a{t)),\fk. Therefore, the LMs appear to be quasi-static ESl 
during the per-user parameter update in (l22l) . and the convergence analysis can be established over 
two timescales separately. We first have the following lemma. 

Lemma 2 (Convergence of Per-user Parameter Learning (Timescale I)): The iterations of the per- 
user parameter 0* in the proposed learning algorithm [T] will converge almost surely to a stationary 
point, i.e., limt^oo ©* = 0°°(7), and 9°°(7) satisfies 

VeV'(e°°(7),7) = 0. (24) 

Proof: Please refer to Appendix |B] ■ 
On the other hand, during the LM update (timescale II), we have linit^^oo I ©* — 0°° (7* ) I = almost 
surely. Hence, during the LM update in (l22l) . the per-user parameter is seen as almost equilibrated. 
The convergence of the LMs is summarized below. 

Lemma 3 ( Convergence of LM over Timescale II): The iterations of the LMs lim^^oo 7* = 7°° 
almost surely, where 7°° satisfies the power constraints of all the users in (fTOl ). ■ 
Proof: Please refer to Appendix O ■ 
Based on the above lemmas, we can summarize the convergence performance of the proposed 
learning algorithm in the following theorem. 
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Theorem 1 ( Convergence of Online Learning Algorithm [7]).- In the learning algorithm [TJ we have 
(6*, 7*) ^ (0°°,7°°) almost surely, where and 7°° satisfy the KKT condition given in ([T9l) . 
i.e., 

Ve^(e°°,7°°) = 0,j^(Pkie°°) - PO) = (25) 

and the power constraints of all the users in (ITOl) . Furthermore, if Vqq'0(O°°, 7°°) >- (positive 
definite matrix), then 0°° is a local optimal solution for the constrained DEC-POMDP problem in 
(USl). ■ 
Note that V|eV'(0°°,7°°) ^ is a very mild condition that is usually satisfied ll28l . 

V. Decentralized Solution for Non-Cooperative POSG Problem 

In this section, we shall propose a decentralized online policy gradient update algorithm to find 
a local equiUbrium of the non-cooperative POSG problem. The proposed solution also has low 
complexity and does not require explicit knowledge of the CSI statistics, random data source statistics 
as well as the renewable energy statistics. 

A. Decentralized Stochastic Policy Gradient Update 

From ([T8]) . the Lagrangian function for user k is given by 

M^k, e_fe, 7) = Pk^ki&k, e_fc) + ^k(Pk{&k, Q-k) - p^), (26) 

where 7^ € is the LM w.r.t. the average power constraint for user k. Following similar perturbation 
analysis as in Lemma [TJ the gradient Ve^V'felQfci 0-^,7*:) is given in the following lemma. 

Lemma 4 ( Gradient of the Lagrangian Function ): The gradient of the Lagrangian function in (l26l ) 
is given by 

VeMQk, @-k,lk) = Ex Ep Qhxi®, ^)^iSe^p^i'^(^^ ®)' (27) 

where 

T'-l 

%(x,P;7fc,0) =IE^''[ 5] {f{Qk{t))+ik{Pk,ac{t)-Pk)-MQk,Q-k,ik))\x{0)=x,P{0)=P 

t=0 

(28) 
■ 

Based on the Lemma |4l we shall propose a low complexity decentralized online policy gradient 
update algorithm to obtain a local equilibrium. Specifically, the key steps for decentralized online 
learning is given below. 

• Step 1, Initialization: Each transmitter initiates the local parameter 6^. 
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• Step 2, Per-user Power Allocation: At the beginning of the t-th frame, each transmitter 
determines the transmission power allocation according to the policy fi®'' based on the local 
system state Xk, and transmit at the associated achievable data rate given in 

• Step 3, Message Passing among the K Transmitters: At the end of the t-th frame, each 
transmitter shares the one bit reference state indication (k, where Cfe = 1 if {Qk = Qi,Ek = E^}, 
and (j^ = otherwise. 

• Step 4, Per-user Parameter Update: Based on the current local observation, each of the 
transmitters updates the local parameter according to Algorithm |2l 

• Step 5, Per-user LM Update: Based on the current local observation, each of the transmitters 
updates the local LMs {7/0, VA;} according to Algorithm |2] 

Fig. [3] illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters 
and LMs update in Step 4 and Step 5 is given below: 

Algorithm 2 (Online Learning Algorithm for Per-user Parameter and LM): Let Xk = {Hkk, Qk^Ek} 
be the current local system state, be the current realization of power allocation, C, = Hfc Cfe be 
the current realization of the reference state indication. The online learning algorithm at the /c-th 
transmitter is given by 



Remark 5 (Features of the Learning Algorithm |2l).- The learning algorithm only requires local ob- 
servations, i.e., local system state {Hkk-, Qk,Ek} at each transmit node, and one bit message passing 
of Ck- Both the per-user parameter and the LMs are updated simultaneously and distributively at 
each transmitter. Furthermore, the iteration is online and proceed in the same timescale as the CSI 
and QSI variations in the learning algorithm. Finally, the solution does not require knowledge of the 
CSI distribution or statistics of the arrival process or renewable energy process, i.e., robust to model 
variations. ■ 

B. Convergence Analysis 

In this section, we shall establish the convergence proof of the proposed decentralized learning 
algorithm|2l Specifically, let rj = max^ „^fc j^, and let T* = {&*} be the set of the local equilibrium 



where = L*. 



= ©1 - {fkiQk) + liiPk,ac - P'k) - Li) Z\ 

7^ =bi^m{Pk,a.-pi)-\\ 

ait) [jk{Qk) + l\{Pk,ac - Pi) - Li), and 



(29) 




(30) 
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of the game (fTTl ). i.e., Q* satisfies tlie fixed point equations in (fTSl ). The convergence performance 
of the proposed learning algorithm is given in the following theorem. 

Theorem 2 ( Convergence of Online Learning Algorithm |2]).' Suppose J-* is not empty. The itera- 
tions of the per-user parameter 0* in the proposed learning algorithm |2] will converge almost surely 
to an invariant set given by 



Remark 6 (Interpretation of Theorem^: From (OTI ). the error between the converged solution 0°° 
and the local equilibrium of the POSG 0* decreases in the order of -rf where rj represents the degree 



In this section, we shall compare the performances of the proposed decentralized solutions against 
various existing decentralized baseline schemes. 

• Baseline 1, Orthogonal Transmission: The transmissions between the K pairs are coordinated 
using TDMA so that there is no interference among the users. Both the AC and renewable power 
consumption are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in |fT6l . 

• Baseline 2, LCSI and LEQSI Only Strategy: The K transmitters send data to their desired 
receiver simultaneously sharing the same spectrum. Both the AC and renewable power consump- 
tion are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in |[T6l . 

• Baseline 3, Greedy Strategy: The K transmitters send data to their desired receiver simulta- 
neously sharing the same spectrum. The transmitters will consume all the available renewable 
energy source at each frame (emptying the renewable energy buffer at each frame), and the AC 
power consumption is adaptive to LCSI only by optimizing the sum throughput. 

In the simulation, we consider a symmetric system where = 0.1, VA;,n 7^ /c as in 16J. The long 
term path loss for the desired link is 15dB, which corresponds to a cell size of 5.6km 1291 . The static 
circuit power is Pcct = 40 (Watt) ll30l . We assume Poisson packet arrivaj^ with average arrival rate Afc 
(packet/s) and exponentially distributed random packet size with mean Nk - 2Mbits. The scheduling 
frame duration r is 50ms, and the total BW hW - IMHz. The maximum data queue buffer size n'^ is 
5 (packets). Furthermore, we consider Poisson energy arrival with average arrival rate (Watt) as in 
|[T6l . and the renewable energy is stored in a 1.2V 20Ah lithium-ion battery. The AC power allocation 

' Note that the proposed algorithm works for generic packet and renewable energy arrival models as depicted in Definition 
|2]and Definition [3] The Poisson model is used for simulation illustration only. 




(31) 



as t — )• 00, for some positive constant 5 = 0{r]'^) and some 0* G J^*. 
Proof: Please refer to Appendix |D] 



of coupling among the transmitters. 



VI. Simulations 
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space and the renewable power allocation space is given by Aa 
(Watt). The average delay is considered as our utility if{Qk) 
Ofc is parameterized in the form given by (fTTI) . 

A. Delay Performance w.r.t. the AC power consumption 

Fig. m illustrates the average delay per user versus the AC power consumption P^. The average 
data arrival rate is A^. = 1.1, and the energy arrival rate is = 800. The average delay of all 
the schemes decreases as the AC power consumption increase, and the proposed schemes achieve 
significant performance gain over all the baselines. This gain is contributed by the DQSI and EQSI 
aware dynamic power allocation. Furthermore, it can also be observed that the solution to the non- 
cooperative POSG problem has similar performance as the solution to the DEC-POMDP problem. 

B. Delay Performance w. r. t. Number of Power Control Levels 

Fig. |5] illustrates the average delay per user versus the number of power control levels that lie 
between and 1.5kW. The average data arrival rate is = 1.1, the energy arrival rate is = 800, 
and the average AC power consumption is = 800. The average delay of the proposed schemes 
decreases as the number of power control levels increases, yet the performance improvement is 
marginal. It can also be observed that there is significant performance gain with the proposed schemes 
compared with all the baselines, and the solution to the non-cooperative POSG problem has similar 
performance as the solution to the DEC-POMDP problem. 

C. Delay Performance w.r.t. Renewable Energy Buffer Size 

Fig.[6]illustrates the average delay per user versus the renewable energy buffer size N^. Specifically, 
we consider the lithium-ion battery given from 1.2V lOAh to 40 Ah. The average data arrival rate is 
Afc = 1.1, the energy arrival rate is X}^ = 800, and the average AC power consumption is P^ = 500. 
It can also be observed that the proposed schemes achieve significant performance gain over all the 
baselines at any given renewable energy buffer size. 

D. Convergence Performance 

Fig. |7] illustrates the convergence property of the proposed schemes. We plot the randomized power 
control policy /i^j(0i,Pi) versus the scheduling frame index for the POMDP and non-cooperative 
POSG problems, respectively. The average data arrival rate is Afc = 1.1, the energy amval rate is 
Xk = 800, and the average AC power consumption is P^ = 1100. It can be observed that the 
convergence rate of the online algorithm is quite fast. For example, the delay performance of the 
proposed schemes already out-performs all the baselines at the 2500-th scheduling frame. Furthermore, 



= Ae = [0, 300, 600, 900, 1200, 1500] 
= Qk/^k), and the randomized policy 
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the delay performance at the 2500-th scheduUng frame is already quite close to the converged average 
delay. 

VII. Conclusion 

In this paper, we consider the decentralized delay minimization for interference networks with 
limited renewable energy storage. Specifically, the transmitters are capable of harvesting energy from 
the environment, and the transmission power of a node comes from both the conventional utility 
power (AC power) and the renewable energy source. We consider two decentralized delay optimization 
formulations, namely the DEC-POMDP and the non-cooperative POSG, where the control policy is 
adaptive to local system states (LCSI, LDQSI and LEQSI) only. In the DEC-POMDP formulation, 
the controllers interact in a cooperative manner and the proposed decentralized policy gradient 
solution converges almost surely to a local optimal point under some mild technical conditions. 
In the non-cooperative POSG formulation, the transmitter nodes are non-cooperative. We extend the 
decentralized policy gradient solution and establish the technical proof for almost-sure convergence of 
the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the 
delay performance of the proposed solutions are compared with conventional baseline schemes for 
interference networks and it is illustrated that substantial delay performance gain and energy savings 
can be achieved by incorporating the CSI, DQSI and EQSI in the power control design. 

Appendix A 
Proof of Lemma [T] 

From the perturbation analysis ifTTl . ll22ll in MDP, the gradient VoV'CqS given by 

VeV-Ce) = ^ix; e){Ve5^(x, e) + ^^(v© Ft{x\x, e})v{x')}, m 

^ x 
where V{x) satisfies the following Bellman (Possion) equation 

vix) + m) = gAx, &) + Yl ^'{x\x, @}Vix')- (33) 

x' 

Since X]p/"x(®'-^) ~ ^ every 0, we have 

Ve.5^(x,e) = Ep a^x(Q. p) iadx, p) - ^(e)) ^^^^ 
Ve.Pr{x'|x,e} = EpMx(Q^p) Mx'\x,p}. 

Substituting (l34l ) into (l32l ). we have 

VeXe) = Y,Y. <x; @)f^AQ, p)^^^^^^^%^?(x, P; e), (35) 

'"The notation of 7 is ignored in this section for simplicity. 
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where 



q{x, P; 7, ©) = 9^{X, P) - m) + E 1^' 



(36) 



Taking the conditional expectation (conditioned on {Q, E}) on both sides of (1331 ). we have following 
equivalent Bellman equation 

y(Q,E) + V(e) =E[5^(x,e)]+ J2 E[Pr{Q',E'|x,e}] y(Q',E'), (37) 

Q',E' 

where V{Q, E) = E[y(x|Q, E)]. It can be verified that the following differential utility V{Q, E) of 
state (Q, E) satisfying the above equivalent Bellman equation 

T'-l 

y(Q,E)=E^" J](5^(x,e)-V(©))|Q(0) = Q,E(0)=E| , (38) 



t=o 



where = mm{t > 0|Q* = Q^,E* = E^} is the first future time that reference state (Q^,E^) is 
visited. Therefore, we have 

q{x, P; 6) = g^x, P) - V'(e) + Eq'.e' Pr{Q', E' \x, P}F(Q', E') 

= lE^' [ Eto' {9^{x{t),P{t)) - V'(e)) |x(o) = X, P(o) = P 

which finishes the proof. 



(39) 



Appendix B 
Proof of Lemma [2] 

In timescale I, we can rewrite the update equations in (l22l) as follows 



r* + a(t)i?(x*,P*,r*^ 



where r = (0 , L*), and 



R{x\P\r'^ 



-{94x',F')-L')4 



-{9^{X','P')-L^) 

Define tm the m-th time that the recurrent state (Q^,E^) is visited, and we have 



where 5(m) = Et=t^~^ e"" = Ei=?^~^ a(i)(^(x*, P*, r*) - h{r^"^)), and 

-(■0(9*'") -L*-) 

where W{Q) = [Wi{Q), ■■■ , 1^^,(9)], and Wk{@) = E^" [ Et=ir'(Wi-^) "'^'^je^.p.) 
we shall show that the following holds almost surely. 

l2 



+ a(t)i?(x*,P*,r*) =r*'" +a(m)/i(r*'") + e" 



t=t„ 



h{r* 



(40) 



(41) 



(42) 



oo oo 

a(m) = oo, ^^[a(m)r < oo. 

m=l m=l 



(43) 



. Next 



(44) 
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Specifically, J2m=i '^{f^) = St^i '^(0 = Furthermore, since a{t) is non-increasing and 
(Q^,E^) is a recurrent state, we have 

lEE™=i[5("i)]'] < nE^=i a(^n^)'(Wl - tm?] = EE^=i a{t^fE{tm+i - tn.?] < oo. 

(45) 

Therefore, X]m=i['^("^)]^ finite expectation and is finite almost surely. Following the same way, 
it is easy to infer that has finite expectation and is finite almost surely, and hence e"^ 

converges almost surely. Since Sm and a(m) converges to zero almost surely, and h{r^"^) is bounded, 
we have 

lim (r*'"+i - r*") = 0. (46) 

m->oo 

Then, similar to the proof of ll22l Lemma 11], we can show that ip{Q^'^) and L*™ converge to a 
common limit. Since ^(0*™) — L*™ converges to zero, the algorithm of the per-user parameter update 
is given by 

0Wi ^ Qt^ +5(^)(VV;(G*") + e"') (47) 

where converges to zero and e"^ is a summable sequence. This is a gradient method with 
diminishing errors. Therefore, by following the same way as in |[28l and IIBTI . we can conclude 
that the learning algorithm will converges to a equilibrium Q°° almost surely, given by 

VeV'(e°°,7) = 0. (48) 

Appendix C 
Proof of Lemma [3] 

Due to the separation of timescale, the primal update of the per-user parameter can be regarded 
as converged to 0^,(7*) w.r.t. the current LMs 7*. Specifically, for timescale II, we can rewrite the 
update equations in (l22l ) as follows 

7^+' + b{t) (Pfc(e*(7)) - + Pk,ac - Pk{@*h)))Y. (49) 



it"' 



Let F* = a{'y\wl,l < t) be the cj-algebra generated by {'yi,wl,l < t}. Note that E[i(;^+^[F*] = 
0, and E[[|7i;^"^^[p[F*] < Ci(l + ||7||) for a suitable constant Ci. Using the standard stochastic 
approximation argument ll28l . the dynamics of the LMs learning equation in (l22l) for user k can be 
represented by the following ordinary differential equation (ODE): 

ik{t) = Pkie*hm-Pk, (50) 

where Q*{'j{t)) is the converged per-user parameter under the LM 7*. Define 

G(7) = ^(e*(7),7) = E (AT,(e*(7)) + 7k(Pk{Q*h)) - p^)) ■ (51) 

k 
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By the chain rule and VeV'(e*(7), 7) = 0, we have ^ = . j^^^.^^^ + 

M|22) ^ p^(0*,^^-)) _ po Therefore, we show that the ODE in §0^ can be expressed as ik{t) = 
As a result, the ODE in ^ will either converge to Pfc(e*(7)) - -P° = or {^^^ < 
0, 7fc = 0} which satisfies the average power constraints in (ITOl ) . 

Appendix D 
Proof of Theorem [2] 

Note that when rj = 0, i.e., the interference is zero for each user, we have T}^{Qj^,Q_k;ri = 
0) = Tk{@k) and Pk{@k, &-k','n = 0) = Pk{&k) in problem ([TV] ). In other words, the other users' 

-k 



control policies i^^.,'' do not influence the average delay utility and AC power consumption 



Pk for user k, since there is no interference. Specifically, denote 7^(0,^.) = Tfc(0;fc, B_fc; ?? = 

0) Pl{@k) = Pki&k,&-k;V = 0), and hence V'fc(0fc,7fc) = i^k{&k,&-k,7k;v = 0). From the 
convergence analysis in Section ITV-BI the per-user parameter and LM will converges to the equilibrium 



k 

VeMQk,Q-k,7k) = 0,Ve.eMQk,Q-k,7k) ^ (52) 



point 0° = 0^. given by 



and 7*^ satisfies the AC power constraint. 

Let Afc = [0A,.;7fc] and A = [Ai, • • • , A^^:]. From |[28l . we can rewrite the update algorithm as the 
following ODE 



A(t) = f{A{t)) 



-Ve,V'?(ei,7i) -Ve,<(0if,7i^) 



(53) 



Note that f^{A^) = 0, i.e.. A" is the equilibrium point for the above ODE. The Jacobian matrix 
Df^{A^) at the equilibrium point Aq is given by 

^ -V|^e.V'?(ei,7i) -Ve,P?(ei) ^ 



D/0(A0^ 



Ve,P?(0i) 



\ Vq,Pk{Qk) / 

(54) 

Since V|,^^q^'0a'(®^' 7^) ^' i*- been shown in f25l that all eigenvalues of Df^{AP) have 
strictly negative real parts. Therefore, A*^ is exponentially stable. By converge of Lyapunov Theorem 
Il32l . fliere exists a Lyapunov function V{K) for A(t) = f{A{t)), s.t. Ci||A - A°||2 < V{A) < 



C2IIA - A0||2, and ^^f{A) < -C3IIA - A0||2,VA for some positive constant {Ci, 62,03}. 
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When 1] ^ 0, let 77^ = maxj j^, and from Taylor expansion we have 

^,(6,7.) = ^^(e„7.) + X;^^|f%^ + 0(,^) (55) 

~[ ^kk OLki/ Lkk 



1= 



Pk{Q,lk) = Pki®k,lk) + 2^j jn~Tr + OiVk) 

~[ J^kk CiLki/Lkk 



(56) 



Therefore, we can rewrite the update algorithm as the following ODE 

A(t) = /(A(t)) + e(A(t)) (57) 
where ||e(A(f))|| = 0(77). Then we have 

yW = W = ^A = ^(m + 6(A))<-C3!|A-A0||2 + 2C2||A-A0||.|KA)l| 

• (Jo) 

= -||A-A0||(C3[|A-A0|I-2C2|Ie(A)||). 

Note that ^(A) < for all A s.t. (C3)2j|A - A°|j2 > 4(^2)2 |[e(A)||2 = 6 = 0{rf). As a result, 
A* converges almost surely to an invariant set given by 5 = {A: j|A — A'^|p — 5<0}. Furthermore, 
from V{A*) = 0, we have [|A* — A'^|p — 5 < 0, and hence the invariance set is also given by 
S = {A: |[A — A*||^ — (5<0}. Finally, we can conclude that the iterations of the per-user parameter 
G* in the proposed learning algorithm |2] will converge almost surely to an invaiiant set given by 

5e^{e: ||G-G*[|-(5<0} (59) 

as t ^ 00, for some positive constant 5 = 0{'rf) and some Q* G J"*. 
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TABLE I 

Communication overhead comparison for exchanging the per-stage utility and sharing the buffer 

states and the csi states 
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Fig. 1. System model. Each transmitter maintains a data queue for the random traffic flow towards the desired receiver 
in the system. Furthermore, each transmit node is capable of harvesting energy from the environment and storing it in an 
energy buffer. 
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Fig. 2. The system procedure of the decentrahzed per-user parameter and LM online learning algorithm for DEC-POMDP 
problem. 
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Fig. 3. The system procedure of the decentralized per-user parameter and LM online learning algorithm for POSG 
problem. 
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Fig. 4. Delay performance per user versus the AC power consumption P^. . The average data arrival rate is — 1.1 
(packet per second), and energy arrival rate is Xk = 800 (Watt). The renewable energy is stored in a 1.2V 20Ah lithium-ion 
battery. 
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Fig. 5. Delay performance per user versus the number of power control levels that lie in and 1.5kW. The average 
data arrival rate is = 1.1 (packets per second), the energy arrival rate is Xk. ~ 800 (Watt), and the average AC power 
consumption is P" = 800 (Watt). The renewable energy is stored in a 1.2V 20Ah lithium-ion battery. 
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Battery Size (x1 OAh) 



Fig. 6. Delay performance per user versus the renewable energy buffer size iVf . Specifically, we consider the lithium-ion 
battery given from 1.2V lOAh to 40Ah. The average data arrival rate is Afe — 1.1 (packets per second), the energy arrival 
rate is Xk — 800 (Watt), and the average AC power consumption is = 500 (Watt). 
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Scheduling Frame Index 



(a) POMDP Problem 
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Fig. 7. Convergence property of the proposed scheme. The average data arrival rate is At = 1.1 (packets per second), the 
energy arrival rate is = 800 (Watt), and the average AC power consumption is = 1100 (Watt). The figure illustrates 
the instantaneous randomized power control policy /i^j (Oi, Pi) (Q\ — 2,Ei — 0) versus scheduling frame index for the 
POMDP and POSG problems, respectively. The boxes indicate the average delay of various schemes at the selected frame 
indices. 
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